model tuning and stability
interpretability
application to case study
\(\lambda\) is a hyperparameter that controls model complexity.
As \(\lambda \uparrow \infty\), all coefficients shrink toward zero. As \(\lambda \downarrow 0\), we return to OLS.
We can study the order that variables enter the model as we gradually decrease \(\lambda\). We expect that the most important features enter first.
Q: Can you tell if these coefficient paths \(\hat{\beta}_{j}\left(\lambda\right)\) come from ridge vs. lasso regression?
What is an appropriate “budget” for model complexity?
Decreasing \(\lambda\) allows too many noise features into the model. Increasing \(\lambda\) misses true signals.
We need to tune \(\lambda\) to balance these competing issues and achieve good performance on holdout samples.
Split the data into \(K\) folds (e.g., \(K = 5\)). Fit the model on \(K−1\) folds and tested on the remaining fold. This mimics the setting of gathering new data and testing the model on that data.
We can compute performance on the holdout folds across \(\lambda\) complexity parameters. For each \(k = 1, \dots, K\), we compute
\[\begin{align*} CV_{k}\left(\lambda\right) = \frac{1}{\left|I_{k}\right|} \sum_{i \in I_{k}} \left(y_{i} - \hat{y}_{i}^{-k}\left(\lambda\right)\right)^2 \end{align*}\] where \[\begin{align*} \hat{y}_{i}^{-k}\left(\lambda\right) := x_{i}^\top \hat{\beta}^{-k}\left(\lambda\right) \end{align*}\] and \(\hat{\beta}^{-k}\left(\lambda\right)\) solves the lasso optimization at hyperparameter \(\lambda\) using data from all folds except \(I_{k}\).
The initial decrease is the phase where removing noise features improves performance. The later increase happens when we start ignoring real signal features.
\(\lambda_{\text{min}}\): Minimizes the cross-validation error.
\(\lambda_{\text{1se}}\): The simplest model, i.e., largest \(\lambda\), whose error is within one standard error of the minimum. This is a sparser model with comparable performance to the best.
Any statistical analysis depends on the particular data observed. Different data would yield different results.
The bootstrap estimates this variability without collecting new data.
In regularized regression, we quantify uncertainty in both predictions \(\hat{y}_i\) and coefficients \(\beta_k\).
Simulate new datasets by resampling the original data with replacement. For each bootstrap iteration \(b = 1, \dots, B\):
\[\begin{align*} \mathbf{y}^{b} = \begin{pmatrix}y_{1}^{b} \\ \vdots \\ y_{n}^{b}\end{pmatrix}, \quad \mathbf{X}^{b} = \begin{pmatrix}x_{1}^{b} \\ \vdots \\ x_{n}^{b}\end{pmatrix} \end{align*}\]
These were the properties of intrinsically interpretable models we introduced earlier.
The \(\ell^{1}\)-penalized linear model focuses our attention on a small subset of selected features where \(\hat{\beta}_{j} = 0\).
The larger the \(\lambda\), the more interpretable the model becomes, since the support set \(S = \{j : \hat{\beta}_{j} \neq 0\}\) shrinks.
A linear model \(\hat{y} = \hat{\beta}_{0} + \sum_{j \in S} x_j \hat{\beta}_{j}\) is simulatable when \(|S|\) is small.
But for even moderate to large \(|S|\), simulatability becomes a challenge even for linear regression (Slack et al. 2019)!
The model is additive: \(f(x) = \sum_{j = 1}^{J}f_j(x_j)\).
We can inspect the influence of feature \(j\) using a single coefficient \(\beta_{j}\).
But even in linear regression, ceteris paribus means each coefficient has to be understood relative to all other features in the model.
Global: \(\hat{\beta}\) helps us identify the most influential features across the study.
Local: For sample \(i\), the \(x_{ij}\hat{\beta}_{j}\) terms help us understand its particular prediction.
Response (\(y\)): Cell viability after drug treatment.
Predictors (\(X\)): High-dimensional molecular profiles \(N = 121, J = 9553\).
Goal: Identify a sparse set of features that distinguish drug sensitive vs. resistant cells.
We use the glmnet package in R:
# Fit the lasso path and perform 10-fold CV
cv_fit <- cv.glmnet(X, y, alpha = 1, nfolds = 10, standardize = TRUE)This solves the optimization problem for a sequence of \(\lambda\) values and estimates cross-validation error across \(K = 10\) folds.
alpha = 1 is lassoalpha = 0 is ridgeRespond to [Evaluation Critique] in the exercise sheet.
It’s important to transform \(x_{ij} \to \frac{x_{ij} - \bar{x}_j}{\hat{\sigma}_j}\).
The penalty \(\lambda \sum |\beta_j|\) treats all \(\beta_j\) equally. If \(x_j\) has a large scale, its \(\beta_j\) will be small, making it “cheaper” for the penalty to keep it in the model.
We can visualize how the ridge coefficients \(\hat{\beta}_j\) evolves as \(\lambda\) decreases.
Each line represents a gene (RNA or methylation feature). The order in which they “emerge” from zero indicates their relative importance in predicting drug response.
These are the equivalent results for the lasso. Notice that coefficients stay at exactly 0 for many \(\lambda\).
Sparse linear models narrowed us down from 9K+ features to 37 that are enough to predict drug sensitivity with relatively high accuracy.
In this problem, standardization was essential. We are lucky that it is the default in glmnet, but this isn’t always the case.
The original scale of the features might be informative. By standardizing features, we might be giving undue weight to noise features with low variability.
Are there distinct gene pathway signatures for different drugs? In this case, we should consider either drug-pathway interaction terms or drug-specific classifiers.
The Problem: When predictors are highly correlated, the \(\hat{\beta}\) estimated by lasso becomes unstable.
Small changes in the training data can cause the lasso to switch between two highly correlated genes, leading to conflicting interpretations of the same underlying signal.
We simulated data where \(\text{Cor}\left(x_{2}, x_{3}\right) > 0.95\). The true coefficients are
These are the feature selection frequencies across 1000 bootstrap iterations. Genes 1 and 2 compete for selection.
Selection \(\neq\) Importance
More generally, we should be careful applying sparse models to correlated data. There may be many plausible, equally predictive explanations based on different subsets of features.
Often, people assume that to get higher accuracy, we need “black-box” models (like deep learning or Random Forests) and sacrifice interpretability.
But in many scientific problems (like today’s case study, see also (“Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio”)), sparse linear models are surprisingly competitive, while remaining more interpretable to more audiences. It depends on the true relationship between predictors and response in the data.
Respond to [Code Analysis] in the exercise sheet.